1 Introduction

In this Homework, we will be examining sales data from Trendyol for 9 different Products. The data include products from various categories and they will be examined seperately. For each product, we will try to find a pattern that the sales follows and examine the general trend. We will use this pattern to decompose it from the data, and then we will use AR, MA and their combination to determine the best fitting prediction method. Such predictions based on sales is important for companies to plan their pricing or inventories. This will be a primitive version of works that most of firms deal with in real time. In the data some of the regressors were not accurate such as overall visits to Trendyol Website, so try not to involve such variables into our models

2 Xiaomi

library(data.table)
library(ggplot2)
library(skimr)
library(ggcorrplot)
library(GGally)
library(forecast)
library(lubridate)
library(urca)
load("C:/Users/Asus/Desktop/Spring 2021/IE 360/hw4-5/homework 4-5.RData")

2.1 Decomposition

In this part of the analysis, the series are decomposed at different levels.

tsdisplay(ts(xiaomi$sold_count))

2.1.1 Daily Seasonality

First, series is plotted to detect the possible seasonal periods. As it can be seen from the autocorrelation plot below, autocorrelation is highest at lag 32.

acf(xiaomi$sold_count, lag.max = 50)

The data is first decomposed at a weekly level with frequency 7, the results are displayed below. Additive decomposition is chosen since the variance of the series is not affected with level fo the time series. The random component is stationary with a mean close to zero and constant variance. Also, based on the result of KPSS test, the random component is stationary.

plot(xiaomi_ts_week_dec)

summary(ur.kpss(xiaomi_ts_week_dec$random))
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0112 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

2.1.2 Decomposition with frequency = 32

Second, the data is decomposed with frequency 32, the results are displayed below. Additive decomposition is chosen since the variance of the series is not affected with level fo the time series. The random component is stationary with a mean close to zero and constant variance. Also, based on the result of KPSS test, the random component is stationary.

plot(xiaomi_ts_year_dec)

summary(ur.kpss(xiaomi_ts_year_dec$random))
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0131 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Since the random component from the first decomposition looks more like stationary and its KPSS results are better, it is chosen for model construction.

2.2 Arima Model

ACF and PACF plots indicate a moving average behavior since PACF decrease exponentially and ACF spikes and becomes insignificant after some point.

acf(xiaomi_random, na.action = na.pass, lag.max = 50)

pacf(xiaomi_random, na.action = na.pass, lag.max = 50)

After trying several alternatives, model below is chosen with parameters p = 2 and q = 4. The model chosen has the lowest AIC value among the alternatives.

summary(m3)
## 
## Call:
## arima(x = xiaomi_random, order = c(2, 0, 4))
## 
## Coefficients:
##           ar1      ar2      ma1      ma2      ma3      ma4  intercept
##       -0.1073  -0.1799  -0.0238  -0.1873  -0.6167  -0.1722    -0.0305
## s.e.      NaN   0.0819      NaN      NaN      NaN      NaN     0.0989
## 
## sigma^2 estimated as 9381:  log likelihood = -2333.65,  aic = 4683.3
## 
## Training set error measures:
##                     ME     RMSE      MAE      MPE     MAPE      MASE
## Training set -1.117634 96.85467 61.68407 89.07912 177.5698 0.6525615
##                      ACF1
## Training set -0.003936123

Residuals of the model satisfy the model assumptions. Residuals distributed normally with zero mean and approximately constant variance. However, autocorrelations at lag 4 and lag 5 are slightly significant.

checkresiduals(m3)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,4) with non-zero mean
## Q* = 23.013, df = 7, p-value = 0.001696
## 
## Model df: 7.   Total lags used: 14

Fitted (red) and real (black) values are plotted below.

ggplot(xiaomi, aes(x = event_date))+
  geom_line(aes(y = Model1), color = 'red')+
  geom_line(aes(y = sold_count), color = 'black')

Finally, results from the decomposition and arima model are combined. Overall, residuals satisfy the model assumptions.

ggplot(xiaomi, aes(x = event_date))+
  geom_line(aes(y = sold_count - Model1))
## Warning: Removed 6 row(s) containing missing values (geom_path).

acf(xiaomi$sold_count-xiaomi$Model1, na.action = na.pass)

2.3 Additional Regressors

ggpairs(xiaomi[,c(4,18:20)])

# plot(xiaomi$price) # ok
# plot(xiaomi$visit_count) 
# plot(xiaomi$basket_count) # ok
# plot(xiaomi$favored_count)
# plot(xiaomi$category_sold) # not ok unusual behavior
# plot(xiaomi$category_basket)
# plot(xiaomi$category_visits)
# plot(xiaomi$category_favored) # ok
# plot(xiaomi$category_brand_sold)

Only price with lag 1, basket count with lag 2 and category favored with lag 2 can be considered as potential regressors since rest of the variables are either missing or dirty.

Among the alternatives price with lag 1 and basket count with lag 2 are highly correlated with the sales data.

2.4 Arima with Regressors

At this stage of the analysis, external regressors determined above are added to model. Results are as follows.

summary(m2)
## 
## Call:
## arima(x = xiaomi_random, order = c(3, 0, 2), xreg = matrix(c(xiaomi$price_lag1, 
##     xiaomi$basket_count_lag2), byrow = FALSE, ncol = 2))
## 
## Coefficients:
##          ar1     ar2      ar3      ma1      ma2  intercept
##       0.3538  0.1219  -0.3731  -0.5472  -0.4528   -11.1106
## s.e.  0.1637  0.1367   0.0695   0.1723   0.1722     2.5258
##       matrix(c(xiaomi$price_lag1, xiaomi$basket_count_lag2), byrow = FALSE, ncol = 2)1
##                                                                                 0.0381
## s.e.                                                                            0.0188
##       matrix(c(xiaomi$price_lag1, xiaomi$basket_count_lag2), byrow = FALSE, ncol = 2)2
##                                                                                 0.0038
## s.e.                                                                               NaN
## 
## sigma^2 estimated as 9041:  log likelihood = -2326.78,  aic = 4671.55
## 
## Training set error measures:
##                   ME     RMSE     MAE     MPE    MAPE      MASE         ACF1
## Training set -2.0912 95.08518 61.9725 48.4812 217.848 0.6556128 -0.001766772
checkresiduals(m2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,0,2) with non-zero mean
## Q* = 19.462, df = 6, p-value = 0.003451
## 
## Model df: 8.   Total lags used: 14

Arima model satisfies the model assumptions. There is a slightly significant autocorrelation at lag 6. However, it is not highly significant.

Finally, decomposition and arima values are combined. Fitted values and real values are plotted below.

ggplot(xiaomi, aes(x = event_date))+
  geom_line(aes(y = Model2), color = 'red')+
  geom_line(aes(y = sold_count), color = 'black')

Model assumptions are satisfied, residuals are distributed with mean zero and constant variance and they have no significant autocorrelation.

ggplot(xiaomi, aes(x = event_date))+
  geom_line(aes(y = sold_count - Model2))

acf(xiaomi$sold_count-xiaomi$Model2, na.action = na.pass)

2.5 Evaluation

frame_xiaomi
##   n     mean       sd        CV      FBias      MAPE     RMSE      MAD
## 1 7 300.1429 53.28986 0.1775483 -0.3820318 0.4141282 122.3179 114.6641
## 2 7 300.1429 53.28986 0.1775483 -0.3386647 0.3727282 110.1031 101.6478
##        MADP     WMAPE    WMAPE2
## 1 0.3820318 0.3820318 0.3820318
## 2 0.3386647 0.3386647 0.3386647

Finally two models are tested over a period of one week. Second model, model with external regressors, performs better as it has a lower WMAPE value.

3 Fakir

tsdisplay(ts(fakir$sold_count))

3.1 Decomposition

3.1.1 Daily Seasonality

First, series is plotted to detect the possible seasonal periods. As it can be seen from the autocorrelation plot below, autocorrelation is highest at lag 16.

acf(fakir$sold_count, lag.max = 50)

The data is first decomposed at a weekly level with frequency 7, the results are displayed below. Additive decomposition is chosen since the variance of the series is not affected with level fo the time series. The random component is stationary with a mean close to zero and constant variance. There is a slight increase in the variance through the middle of the series. Also, based on the result of KPSS test, the random component is stationary.

plot(fakir_ts_week_dec)

summary(ur.kpss(fakir_ts_week_dec$random))
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0063 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

3.1.2 Decomposition with frequency = 16

Second, the data is decomposed with frequency 16, the results are displayed below. The random component is stationary with a mean close to zero and constant variance. Also, based on the result of KPSS test, the random component is stationary.

Since the random component from the first decomposition looks more stationary and its KPSS results are better, it is chosen for model construction.

plot(fakir_ts_year_dec)

summary(ur.kpss(fakir_ts_year_dec$random))
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0209 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

3.2 Arima Model

ACF and PACF plots indicate a moving average behavior since PACF decrease exponentially and ACF spikes and becomes insignificant after some point. However, there is slight increase in autocorrelation at lag 16.

acf(fakir_random, na.action = na.pass, lag.max = 50)

pacf(fakir_random, na.action = na.pass, lag.max = 50)

After trying several alternatives, model below is chosen with parameters q = 4. The model chosen has the lowest AIC value among the alternatives. These parameters also implied in the ACF plots as the autocorrelation decreases considerably after lag 4.

summary(m1_fakir)
## 
## Call:
## arima(x = fakir_random, order = c(0, 0, 4), include.mean = FALSE)
## 
## Coefficients:
##           ma1      ma2      ma3      ma4
##       -0.0838  -0.1935  -0.4385  -0.2842
## s.e.   0.0473   0.0446   0.0423   0.0508
## 
## sigma^2 estimated as 259.2:  log likelihood = -1635.42,  aic = 3280.84
## 
## Training set error measures:
##                     ME     RMSE      MAE      MPE     MAPE      MASE       ACF1
## Training set 0.1586474 16.10023 8.611572 67.98593 166.7596 0.6871277 0.01105164

Residuals of the model satisfy the model assumptions. Residuals distributed normally with zero mean and approximately constant variance. However, autocorrelations at lag 16 are slightly significant. Also, distribution of the residuals are highly centered compared to normal distribution.

checkresiduals(m1_fakir)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,4) with zero mean
## Q* = 30.738, df = 10, p-value = 0.000648
## 
## Model df: 4.   Total lags used: 14

Fitted (red) and real (black) values are plotted below.

ggplot(fakir, aes(x = event_date))+
  geom_line(aes(y = Model1), color = 'red')+
  geom_line(aes(y = sold_count), color = 'black')

Finally, results from the decomposition and arima model are combined. Overall, residuals mostly satisfy the model assumptions. Variance of the series inreases during the middle of the series. Also, autocorrelation is not significant

ggplot(fakir, aes(x = event_date))+
  geom_line(aes(y = sold_count - Model1))

acf(fakir$sold_count-fakir$Model1, na.action = na.pass)

3.3 Additional Regressors

ggpairs(fakir[,c(4,18:21)])

# plot(fakir$price) # ok
# plot(fakir$visit_count) 
# plot(fakir$basket_count) # ok
# plot(fakir$favored_count)
# plot(fakir$category_sold) # ok
# plot(fakir$category_basket)
# plot(fakir$category_visits)
# plot(fakir$category_favored) # ok
# plot(fakir$category_brand_sold)

Only price with lag 1, basket count with lag 2, category sold with lag 2 and category favored with lag 2 can be considered as potential regressors since rest of the variables are either missing or dirty.

Among the alternatives all variables except category sold with lag 2 are highly correlated with the sales data.

3.4 Arima with Regressors

At this stage of the analysis, external regressors determined above are added to model. Results are as follows.

summary(m2_fakir)
## 
## Call:
## arima(x = fakir_random, order = c(3, 0, 1), xreg = matrix(c(fakir$price_lag1, 
##     fakir$basket_count_lag2, fakir$category_favored_lag2), byrow = FALSE, ncol = 3), 
##     include.mean = FALSE)
## 
## Coefficients:
##          ar1      ar2      ar3      ma1
##       0.8565  -0.1345  -0.2929  -1.0000
## s.e.  0.0484   0.0646   0.0484   0.0067
##       matrix(c(fakir$price_lag1, fakir$basket_count_lag2, fakir$category_favored_lag2), byrow = FALSE, ncol = 3)1
##                                                                                                            -8e-04
## s.e.                                                                                                        1e-04
##       matrix(c(fakir$price_lag1, fakir$basket_count_lag2, fakir$category_favored_lag2), byrow = FALSE, ncol = 3)2
##                                                                                                             6e-04
## s.e.                                                                                                        4e-04
##       matrix(c(fakir$price_lag1, fakir$basket_count_lag2, fakir$category_favored_lag2), byrow = FALSE, ncol = 3)3
##                                                                                                                 0
## s.e.                                                                                                          NaN
## 
## sigma^2 estimated as 241.1:  log likelihood = -1621.87,  aic = 3259.75
## 
## Training set error measures:
##                     ME    RMSE      MAE      MPE     MAPE      MASE        ACF1
## Training set 0.2120418 15.5258 8.705402 27.66236 257.1356 0.6946145 0.003801097
checkresiduals(m2_fakir)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,0,1) with zero mean
## Q* = 29.123, df = 7, p-value = 0.0001374
## 
## Model df: 7.   Total lags used: 14

Arima model satisfies the model assumptions. There is a slightly significant autocorrelation at lag 16. However, it is not highly significant.

Finally, decomposition and arima values are combined. Fitted values and real values are plotted below.

ggplot(fakir, aes(x = event_date))+
  geom_line(aes(y = Model2), color = 'red')+
  geom_line(aes(y = sold_count), color = 'black')

Model assumptions are satisfied, residuals are distributed with mean zero and constant variance and they have no significant autocorrelation. Although, there are some outlier observations where the errors get larger.

ggplot(fakir, aes(x = event_date))+
  geom_line(aes(y = sold_count - Model2))

acf(fakir$sold_count-fakir$Model2, na.action = na.pass)

3.5 Evaluation

Finally two models are tested over a period of one week. First model, model without external regressors, performs better as it has a lower WMAPE value.

frame_fakir
##   n     mean       sd        CV      FBias      MAPE     RMSE      MAD
## 1 7 300.1429 53.28986 0.1775483 -0.3400742 0.3762535 113.3208 102.0708
## 2 7 300.1429 53.28986 0.1775483 -0.3740729 0.4064323 120.4035 112.2753
##        MADP     WMAPE    WMAPE2
## 1 0.3400742 0.3400742 0.3400742
## 2 0.3740729 0.3740729 0.3740729

4 Altınyıldız

tsdisplay((mont_ts_week))

acf(mont$sold_count, lag.max = 50)

4.0.1 Daily Seasonality

First, series is plotted to detect the possible seasonal periods. As it can be seen from the autocorrelation plot below, autocorrelation is highest at lag 20.

acf(mont$sold_count, lag.max = 50)

The data is first decomposed at a weekly level with frequency 7, the results are displayed below. Additive decomposition is chosen since the variance of the series is not affected with level fo the time series. Variance of the random component increases during the outlier observation in the middle of the series. Other than that, the random component is stationary with a mean close to zero and constant variance. Also, based on the result of KPSS test, the random component is stationary.

plot(mont_ts_week_dec)

summary(ur.kpss(mont_ts_week_dec$random))
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0087 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

4.0.2 Decomposition with frequency = 20

Second, the data is decomposed with frequency 20, the results are displayed below. The random component is stationary with a mean close to zero and constant variance. Also, based on the result of KPSS test, the random component is stationary.

Since the random component from the first decomposition looks more stationary and its KPSS results are better, it is chosen for model construction.

plot(mont_ts_year_dec)

summary(ur.kpss(mont_ts_year_dec$random))
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0152 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

4.1 Arima Model

ACF and PACF for the sales data are plotted below.

acf(mont_random, na.action = na.pass, lag.max = 50)

pacf(mont_random, na.action = na.pass, lag.max = 50)

After trying several alternatives, model below is chosen with parameters q = 5. The model chosen has the lowest AIC value among the alternatives. These parameters also implied in the ACF and PACF plots indicate a moving average behavior.

summary(m1_mont)
## 
## Call:
## arima(x = mont_random, order = c(0, 0, 5), include.mean = FALSE)
## 
## Coefficients:
##           ma1      ma2      ma3     ma4     ma5
##       -0.5254  -0.6688  -0.3011  0.1166  0.3787
## s.e.   0.0478   0.0536   0.0599  0.0534  0.0432
## 
## sigma^2 estimated as 3.678:  log likelihood = -809.77,  aic = 1631.54
## 
## Training set error measures:
##                       ME    RMSE       MAE      MPE     MAPE      MASE
## Training set 0.009715837 1.91781 0.7620904 4183.206 4623.154 0.6267887
##                    ACF1
## Training set 0.01107999

Residuals of the model slightly satisfy the model assumptions. Residuals distributed approximately normally with zero mean and approximately constant variance. The residuals are more centered compared to normal distribution. Also, autocorrelations at lag 19 are slightly significant.

checkresiduals(m1_mont)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,5) with zero mean
## Q* = 12.25, df = 9, p-value = 0.1996
## 
## Model df: 5.   Total lags used: 14

Fitted (red) and real (black) values are plotted below.

ggplot(mont, aes(x = event_date))+
  geom_line(aes(y = Model1), color = 'red')+
  geom_line(aes(y = sold_count), color = 'black')

Finally, results from the decomposition and arima model are combined. Overall, residuals mostly satisfy the model assumptions. Variance of the series increases during the middle of the series. Also, autocorrelation is not significant

ggplot(mont, aes(x = event_date))+
  geom_point(aes(y = sold_count - Model1))

acf(mont$sold_count-mont$Model1, na.action = na.pass)

4.2 Additional Regressors

ggpairs(mont[,c(4,18:21)])

# plot(fakir$price) # ok
# plot(fakir$visit_count) 
# plot(fakir$basket_count) # ok
# plot(fakir$favored_count)
# plot(fakir$category_sold) # ok
# plot(fakir$category_basket)
# plot(fakir$category_visits)
# plot(fakir$category_favored) # ok
# plot(fakir$category_brand_sold)

Only price with lag 1, basket count with lag 2, category sold with lag 2 and category favored with lag 2 can be considered as potential regressors since rest of the variables are either missing or dirty.

Among the alternatives only basket count with lag 2 is highly correlated with the sales data.

4.3 Arima with Regressors

At this stage of the analysis, external regressors determined above are added to model. Results are as follows.

summary(m2_mont)
## 
## Call:
## arima(x = mont_random, order = c(3, 0, 2), xreg = matrix(c(mont$basket_count_lag2), 
##     byrow = FALSE, ncol = 1))
## 
## Coefficients:
##          ar1      ar2     ar3      ma1     ma2  intercept
##       1.3572  -0.8529  0.1953  -1.8968  0.8969     -0.002
## s.e.  0.0622   0.0782  0.0573   0.0378  0.0375        NaN
##       matrix(c(mont$basket_count_lag2), byrow = FALSE, ncol = 1)
##                                                            3e-04
## s.e.                                                         NaN
## 
## sigma^2 estimated as 3.652:  log likelihood = -808.55,  aic = 1633.1
## 
## Training set error measures:
##                      ME     RMSE      MAE      MPE     MAPE      MASE
## Training set 0.01453231 1.911118 0.773554 1279.604 4015.209 0.6362171
##                     ACF1
## Training set -0.01120829
checkresiduals(m2_mont)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,0,2) with non-zero mean
## Q* = 21.6, df = 7, p-value = 0.002977
## 
## Model df: 7.   Total lags used: 14

In the analysis of the residuals we see that the variance of the residuals when there are extreme observations. There is a slightly significant autocorrelation at lag 16. However, it is not highly significant. Also, distribution is centered more when compared to normal distribution.

Finally, decomposition and arima values are combined. Fitted values and real values are plotted below.

ggplot(mont, aes(x = event_date))+
  geom_line(aes(y = Model2), color = 'red')+
  geom_line(aes(y = sold_count), color = 'black')

In the analtsis of th residuals we see that there are some outlier observations where the errors get larger. Also, there is some seasonality in the residuals since for most of the periods trend equals to zero. Finally, there is no significant autocorrelation among the residuals.

ggplot(mont, aes(x = event_date))+
  geom_point(aes(y = sold_count - Model2))

acf(mont$sold_count-mont$Model2, na.action = na.pass)

4.4 Evaluation

Finally two models are tested over a period of one week. First model, model without external regressors, performs better as it has a lower WMAPE value.

frame_mont
##   n     mean       sd        CV      FBias      MAPE     RMSE      MAD
## 1 7 300.1429 53.28986 0.1775483 -0.3699630 0.4056711 120.5562 111.0417
## 2 7 300.1429 53.28986 0.1775483 -0.3818521 0.4186059 123.2360 114.6102
##        MADP     WMAPE    WMAPE2
## 1 0.3699630 0.3699630 0.3699630
## 2 0.3818521 0.3818521 0.3818521
rm(list = ls())
accu=function(actual,forecast){
  n=length(actual)
  error=actual-forecast
  mean=mean(actual)
  sd=sd(actual)
  CV=sd/mean
  FBias=sum(error)/sum(actual)
  MAPE=sum(abs(error/actual))/n
  RMSE=sqrt(sum(error^2)/n)
  MAD=sum(abs(error))/n
  MADP=sum(abs(error))/sum(abs(actual))
  WMAPE = sum((abs(error)/actual)*actual)/sum(actual)
  #WMAPE=MAD/mean
  l=data.frame(n,mean,sd,CV,FBias,MAPE,RMSE,MAD,MADP,WMAPE)
  return(l)
}

updated_data <- read.csv("updated_data.csv")

5 Sleepy Bebek Islak Mendil

5.1 Task 1

bebek_mendil <- updated_data %>%
  mutate(event_date=as.Date(event_date)) %>%
  arrange(event_date) %>%
  filter(product_content_id==4066298) %>%
  as.data.table()
acf(bebek_mendil$sold_count, na.action = na.pass)

pacf(bebek_mendil$sold_count, na.action = na.pass)

There is not any clear seasonality.

5.1.1 Daily decomposition

ggplot(bebek_mendil, aes(x=event_date))+
  geom_line(aes(y=sold_count))+
  labs(title="Sales of Sleepy over time")

Variance changes over time, and slightly increases in the period where series have an upward trend. Multiple decomposition can be used.

bebek_mendil_ts <- ts(bebek_mendil$sold_count, frequency = 7)
bebek_mendil_decomposed <- decompose(bebek_mendil_ts,type = "multiplicative")
plot(bebek_mendil_decomposed)

bebek_mendil_decomposed$random %>%
  ur.kpss() %>%
  summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.1391 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

5.1.2 Weekly decomposition

weekly_bebek_mendil <- bebek_mendil %>%
  group_by(yearweek(event_date)) %>%
  summarise(sold_count=mean(sold_count)) %>%
  rename(yearweek='yearweek(event_date)')
ggplot(weekly_bebek_mendil, aes(x=yearweek,y=sold_count ))+
  geom_line()+
  labs(title="Average weekly sales of Sleepy over time")

We can use additive decomposition, because the variance does not increase significantly as the trend increases.

acf(weekly_bebek_mendil$sold_count, na.action = na.pass)

pacf(weekly_bebek_mendil$sold_count, na.action = na.pass)

bebek_mendil_weekly_ts <- ts(weekly_bebek_mendil$sold_count, frequency = 7)
bebek_mendil_decomposed_weekly <- decompose(bebek_mendil_weekly_ts, type = "additive")
plot(bebek_mendil_decomposed_weekly)

bebek_mendil_decomposed_weekly$random %>%
  ur.kpss() %>%
  summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 3 lags. 
## 
## Value of test-statistic is: 0.0748 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

5.1.3 Monthly decomposition

monthly_bebek_mendil <- bebek_mendil %>%
  group_by(yearmonth(event_date)) %>%
  summarise(sold_count=mean(sold_count)) %>%
  rename(month='yearmonth(event_date)')
ggplot(monthly_bebek_mendil, aes(x=month,y=sold_count))+
  geom_line()+
  labs(title="Average monthly sales of Sleepy over time")

Too few data points, it is not meaningful to decompose.

5.2 Task 2

random_bebek_mendil <- bebek_mendil_decomposed$random
acf(random_bebek_mendil, na.action = na.pass)

pacf(random_bebek_mendil, na.action = na.pass)

Both acf and pacf slightly sinusoidal, we should try both AR and MA models.

arima(random_bebek_mendil, order = c(1,0,0))
## 
## Call:
## arima(x = random_bebek_mendil, order = c(1, 0, 0))
## 
## Coefficients:
##          ar1  intercept
##       0.3164     0.9611
## s.e.  0.0476     0.0254
## 
## sigma^2 estimated as 0.1198:  log likelihood = -141.88,  aic = 289.76
arima(random_bebek_mendil, order = c(2,0,0))
## 
## Call:
## arima(x = random_bebek_mendil, order = c(2, 0, 0))
## 
## Coefficients:
##          ar1      ar2  intercept
##       0.4202  -0.3266     0.9609
## s.e.  0.0474   0.0474     0.0181
## 
## sigma^2 estimated as 0.1069:  log likelihood = -119.44,  aic = 246.88
arima(random_bebek_mendil, order = c(3,0,0))
## 
## Call:
## arima(x = random_bebek_mendil, order = c(3, 0, 0))
## 
## Coefficients:
##          ar1      ar2      ar3  intercept
##       0.3463  -0.2321  -0.2240     0.9606
## s.e.  0.0489   0.0505   0.0488     0.0145
## 
## sigma^2 estimated as 0.1015:  log likelihood = -109.19,  aic = 228.39
arima(random_bebek_mendil, order = c(0,0,1))
## 
## Call:
## arima(x = random_bebek_mendil, order = c(0, 0, 1))
## 
## Coefficients:
##          ma1  intercept
##       0.3993     0.9610
## s.e.  0.0423     0.0238
## 
## sigma^2 estimated as 0.1144:  log likelihood = -132.73,  aic = 271.47
arima(random_bebek_mendil, order = c(2,0,1))
## 
## Call:
## arima(x = random_bebek_mendil, order = c(2, 0, 1))
## 
## Coefficients:
##          ar1      ar2      ma1  intercept
##       1.0116  -0.5132  -0.7295     0.9601
## s.e.  0.0571   0.0432   0.0536     0.0085
## 
## sigma^2 estimated as 0.09719:  log likelihood = -100.71,  aic = 211.42

The lowest AIC value is obtained in the model of order (2,0,1).

5.3 Task 3

ggpairs(bebek_mendil[,-c(1,2,5,7,10,12,13)])

category_sold and/or category_favored may be used as regressors.

5.4 Task 4

arima(random_bebek_mendil, xreg = bebek_mendil$category_sold, order=c(2,0,1))
## 
## Call:
## arima(x = random_bebek_mendil, order = c(2, 0, 1), xreg = bebek_mendil$category_sold)
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs üretimi
##          ar1      ar2     ma1  intercept  bebek_mendil$category_sold
##       0.4693  -0.1437  0.1323     0.6613                       2e-04
## s.e.  0.1555   0.0768  0.1426     0.0360                         NaN
## 
## sigma^2 estimated as 0.07661:  log likelihood = -53.42,  aic = 118.83
auto.arima(random_bebek_mendil, xreg = bebek_mendil$category_sold)
## Series: random_bebek_mendil 
## Regression with ARIMA(3,1,1)(0,0,1)[7] errors 
## 
## Coefficients:
##          ar1      ar2     ar3      ma1    sma1   drift   xreg
##       0.5852  -0.2582  0.0443  -0.9785  0.0995  -1e-03  2e-04
## s.e.  0.0738   0.0568  0.0617   0.0166  0.0519   3e-04  0e+00
## 
## sigma^2 estimated as 0.07515:  log likelihood=-47.03
## AIC=110.05   AICc=110.43   BIC=141.88
auto.arima(random_bebek_mendil, xreg = bebek_mendil$category_favored)
## Series: random_bebek_mendil 
## Regression with ARIMA(2,1,1)(1,0,0)[7] errors 
## 
## Coefficients:
##          ar1      ar2      ma1    sar1   drift  xreg
##       0.4699  -0.2304  -0.9619  0.1001  -4e-04     0
## s.e.  0.0560   0.0554   0.0219  0.0522   2e-04     0
## 
## sigma^2 estimated as 0.0903:  log likelihood=-83.65
## AIC=181.3   AICc=181.59   BIC=209.15
auto.arima(random_bebek_mendil, xreg = cbind(bebek_mendil$category_favored,bebek_mendil$category_sold))
## Series: random_bebek_mendil 
## Regression with ARIMA(3,1,1)(0,0,1)[7] errors 
## 
## Coefficients:
##          ar1      ar2     ar3      ma1    sma1  xreg1  xreg2
##       0.6001  -0.2531  0.0495  -0.9715  0.0975      0  2e-04
## s.e.  0.0358   0.0579  0.0486   0.0151  0.0517      0  1e-04
## 
## sigma^2 estimated as 0.07553:  log likelihood=-47.85
## AIC=111.7   AICc=112.08   BIC=143.53

The model with the lowest AIC is obtained with the model of order (3,1,1), seasonal order (0,0,1), and external regressor category_sold. Also, it is observed that the external regressor have a significant impact on the AIC value: AIC of the model of order (2,0,1) decreases from 211.42 to 118.83.

test_dates <- c(as.Date("2021-06-24"):as.Date("2021-06-30"))

for(i in 1:length(test_dates)){
  
  current_date=test_dates[i]-2
  
  past_data <- bebek_mendil[event_date<=current_date,]
  
  bebek_mendil_ts <- ts(past_data$sold_count, frequency = 7)
  bebek_mendil_decomposed <- decompose(bebek_mendil_ts,type = "multiplicative")
  model <- arima(bebek_mendil_decomposed$random,order = c(3,1,1),seasonal = c(0,0,1), xreg=past_data$category_sold)
  forecasted=predict(model,n.ahead = 2,newxreg = bebek_mendil[event_date==test_dates[i],"category_sold"])
  bebek_mendil[nrow(bebek_mendil)-length(test_dates)+i,Model1:=forecasted$pred[2]*bebek_mendil_decomposed$seasonal[(nrow(bebek_mendil)-length(test_dates)+i)%%7+7]*bebek_mendil_decomposed$trend[max(which(!is.na(bebek_mendil_decomposed$trend)))]]
}

m<-accu(bebek_mendil$sold_count[(nrow(bebek_mendil)+1-length(test_dates)):(nrow(bebek_mendil))],bebek_mendil$Model1[(nrow(bebek_mendil)+1-length(test_dates)):(nrow(bebek_mendil))])
m
##   n     mean       sd        CV      FBias      MAPE     RMSE      MAD
## 1 7 525.7143 140.9902 0.2681879 -0.2874911 0.3959773 248.2186 164.2288
##        MADP     WMAPE
## 1 0.3123917 0.3123917
ggplot(bebek_mendil,aes(x=event_date))+
  geom_line(aes(y=Model1), color="red")+
  xlim(as.Date("2021-06-24"),as.Date("2021-06-30"))+ #Update here
  geom_line(aes(y=sold_count))

6 Oral B - Sarj Edilebilir Dis Fircasi

6.1 Task 1

dis_fircasi <- updated_data %>%
  mutate(event_date=as.Date(event_date)) %>%
  arrange(event_date) %>%
  filter(product_content_id==32939029) %>%
  as.data.table()
acf(dis_fircasi$sold_count, na.action = na.pass)

pacf(dis_fircasi$sold_count, na.action = na.pass)

We see a relatively high autocorrelation in lag 6 and lag 7. I will set frequency=7 instead of 6, although autocorrelation at lag 6 is slightly higher, because we weekly seasonality is more meaningful.

6.1.1 Daily decomposition

ggplot(dis_fircasi, aes(x=event_date))+
  geom_line(aes(y=sold_count))+
  labs(title="Sales of Oral-B over time")

Variance changes over time, and slightly increases in the period where series have an upward trend. That is why I decided to use multiplicative decomposition, but additive decomposition is also a valid option.

dis_fircasi_ts <- ts(dis_fircasi$sold_count, frequency = 7)
dis_fircasi_decomposed <- decompose(dis_fircasi_ts,type = "multiplicative")
plot(dis_fircasi_decomposed)

dis_fircasi_decomposed$random %>%
  ur.kpss() %>%
  summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0795 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

6.1.2 Weekly decomposition

weekly_dis_fircasi <- dis_fircasi %>%
  group_by(yearweek(event_date)) %>%
  summarise(sold_count=mean(sold_count)) %>%
  rename(yearweek='yearweek(event_date)')
ggplot(weekly_dis_fircasi, aes(x=yearweek,y=sold_count ))+
  geom_line()+
  labs(title="Average weekly sales of Sleepy over time")

Variance increases as trend increases, multiplicative model can be used.

acf(weekly_dis_fircasi$sold_count, na.action = na.pass)

pacf(weekly_dis_fircasi$sold_count, na.action = na.pass)

dis_fircasi_weekly_ts <- ts(weekly_dis_fircasi$sold_count, frequency = 7)
dis_fircasi_decomposed_weekly <- decompose(dis_fircasi_weekly_ts, type = "multiplicative")
plot(dis_fircasi_decomposed_weekly)

6.1.3 Monthly decomposition

monthly_dis_fircasi <- dis_fircasi %>%
  group_by(yearmonth(event_date)) %>%
  summarise(sold_count=mean(sold_count)) %>%
  rename(month='yearmonth(event_date)')
ggplot(monthly_dis_fircasi, aes(x=month,y=sold_count))+
  geom_line()+
  labs(title="Average monthly sales of Sleepy over time")

Too few data points, it is not meaningful to decompose.

6.2 Task 2

dis_fircasi_random <- dis_fircasi_decomposed$random
acf(dis_fircasi_random, na.action = na.pass)

pacf(dis_fircasi_random, na.action = na.pass)

Slightly sinusoidal acf and messy pacf, we can try both autoregresssive and moving average models.

arima(dis_fircasi_random, order=c(1,0,0))
## 
## Call:
## arima(x = dis_fircasi_random, order = c(1, 0, 0))
## 
## Coefficients:
##          ar1  intercept
##       0.2547     0.9587
## s.e.  0.0492     0.0269
## 
## sigma^2 estimated as 0.1558:  log likelihood = -189.45,  aic = 384.91
arima(dis_fircasi_random, order=c(2,0,0))
## 
## Call:
## arima(x = dis_fircasi_random, order = c(2, 0, 0))
## 
## Coefficients:
##          ar1      ar2  intercept
##       0.3261  -0.2794     0.9586
## s.e.  0.0489   0.0488     0.0202
## 
## sigma^2 estimated as 0.1436:  log likelihood = -173.74,  aic = 355.48
arima(dis_fircasi_random, order=c(1,0,1))
## 
## Call:
## arima(x = dis_fircasi_random, order = c(1, 0, 1))
## 
## Coefficients:
##           ar1     ma1  intercept
##       -0.0232  0.3388     0.9587
## s.e.   0.1062  0.0926     0.0259
## 
## sigma^2 estimated as 0.1515:  log likelihood = -184.07,  aic = 376.14
arima(dis_fircasi_random, order=c(2,0,1))
## 
## Call:
## arima(x = dis_fircasi_random, order = c(2, 0, 1))
## 
## Coefficients:
##          ar1      ar2      ma1  intercept
##       0.9412  -0.4583  -0.7306     0.9593
## s.e.  0.0605   0.0457   0.0540     0.0096
## 
## sigma^2 estimated as 0.1299:  log likelihood = -154.54,  aic = 319.08
arima(dis_fircasi_random, order=c(3,0,2))
## 
## Call:
## arima(x = dis_fircasi_random, order = c(3, 0, 2))
## 
## Coefficients:
##          ar1     ar2      ar3     ma1      ma2  intercept
##       0.0225  0.3997  -0.4647  0.1809  -0.6233     0.9593
## s.e.  0.0883  0.0849   0.0481  0.0916   0.0839     0.0098
## 
## sigma^2 estimated as 0.1276:  log likelihood = -151.17,  aic = 316.34

ARIMA model of order (3,0,2) performed better; it has the lowest AIC value among these models.

6.3 Task 3

ggpairs(dis_fircasi[,-c(1,2,5,7,10,12,13)])

It is not surprising that the basket_count is the most correlated variable with sold_count. But I will use category_favored as regressor, because I believe it is a more stable variable.

6.4 Task 4

arima(dis_fircasi_random, order=c(3,0,2), xreg = dis_fircasi$category_favored)
## 
## Call:
## arima(x = dis_fircasi_random, order = c(3, 0, 2), xreg = dis_fircasi$category_favored)
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs üretimi
##           ar1     ar2      ar3     ma1      ma2  intercept
##       -0.0385  0.3515  -0.4465  0.2391  -0.5591     0.8975
## s.e.   0.0903  0.0882   0.0470  0.0932   0.0817     0.0100
##       dis_fircasi$category_favored
##                                  0
## s.e.                           NaN
## 
## sigma^2 estimated as 0.125:  log likelihood = -147.01,  aic = 310.01

The model only improved a little when regressor is added.

auto.arima(dis_fircasi_random, xreg = dis_fircasi$category_favored)
## Series: dis_fircasi_random 
## Regression with ARIMA(0,0,1)(1,0,0)[7] errors 
## 
## Coefficients:
##          ma1    sar1  intercept   xreg
##       0.2909  0.0014     0.8287  0e+00
## s.e.  0.0557  0.0643     0.0575  1e-04
## 
## sigma^2 estimated as 0.1454:  log likelihood=-174.03
## AIC=358.06   AICc=358.22   BIC=377.86

auto.arima does not work well in this case.

test_dates <- c(as.Date("2021-06-24"):as.Date("2021-06-30"))

for(i in 1:length(test_dates)){
  
  current_date=test_dates[i]-2
  
  past_data <- dis_fircasi[event_date<=current_date,]
  
  dis_fircasi_ts <- ts(past_data$sold_count, frequency = 7)
  dis_fircasi_decomposed <- decompose(dis_fircasi_ts,type = "multiplicative")
  model <- arima(dis_fircasi_decomposed$random,order = c(3,0,2), xreg=past_data$category_favored)
  forecasted=predict(model,n.ahead = 2,newxreg = dis_fircasi[event_date==test_dates[i],"category_favored"])
  dis_fircasi[nrow(dis_fircasi)-length(test_dates)+i,Model1:=forecasted$pred[2]*dis_fircasi_decomposed$seasonal[(nrow(dis_fircasi)-length(test_dates)+i)%%7+7]*dis_fircasi_decomposed$trend[max(which(!is.na(dis_fircasi_decomposed$trend)))]]
}

s<-accu(dis_fircasi$sold_count[(nrow(dis_fircasi)+1-length(test_dates)):(nrow(dis_fircasi))],dis_fircasi$Model1[(nrow(dis_fircasi)+1-length(test_dates)):(nrow(dis_fircasi))])
s
##   n     mean       sd        CV       FBias      MAPE     RMSE      MAD
## 1 7 72.85714 42.85024 0.5881405 -0.08780776 0.7994462 54.80752 49.75189
##        MADP     WMAPE
## 1 0.6828691 0.6828691
ggplot(dis_fircasi,aes(x=event_date))+
  geom_line(aes(y=Model1), color="red")+
  xlim(as.Date("2021-06-24"),as.Date("2021-06-30"))+ #Update here
  geom_line(aes(y=sold_count))

7 La Roche Posay Temizleme Jeli

7.1 Task 1

yuz_temizleyici <- updated_data %>%
  mutate(event_date=as.Date(event_date)) %>%
  arrange(event_date) %>%
  filter(product_content_id==85004) %>%
  as.data.table()
acf(yuz_temizleyici$sold_count, na.action = na.pass)

pacf(yuz_temizleyici$sold_count, na.action = na.pass)

Interestingly, we see a relatively high autocorrelation in lag 15.

7.1.1 Daily decomposition

Data only has weak seasonality, but I set frequency=15, because autocorrelation is relatively higher in lag 15.

ggplot(yuz_temizleyici, aes(x=event_date))+
  geom_line(aes(y=sold_count))+
  labs(title="Sales of La Roche over time")

We see a slightly increasing trend, and also the variance is higher when compared to the previous periods, except the peaks in November 2020 and January 2021. Multiplicative decomposition would be more suitable for this series.

yuz_temizleyici_ts <- ts(yuz_temizleyici$sold_count, frequency = 15)
yuz_temizleyici_decomposed <- decompose(yuz_temizleyici_ts,type = "multiplicative")
plot(yuz_temizleyici_decomposed)

7.1.2 Weekly decomposition

weekly_yuz_temizleyici <- yuz_temizleyici %>%
  group_by(yearweek(event_date)) %>%
  summarise(sold_count=mean(sold_count)) %>%
  rename(yearweek='yearweek(event_date)')
ggplot(weekly_yuz_temizleyici, aes(x=yearweek,y=sold_count ))+
  geom_line()+
  labs(title="Average weekly sales of Sleepy over time")

Variance only slightly increases as trend increases, additive decomposition may be used.

acf(weekly_yuz_temizleyici$sold_count, na.action = na.pass)

pacf(weekly_yuz_temizleyici$sold_count, na.action = na.pass)

We should try ARMA models.

yuz_temizleyici_weekly_ts <- ts(weekly_yuz_temizleyici$sold_count, frequency = 9)
yuz_temizleyici_decomposed_weekly <- decompose(yuz_temizleyici_weekly_ts, type = "additive")
plot(yuz_temizleyici_decomposed_weekly)

7.1.3 Monthly decomposition

monthly_yuz_temizleyici <- yuz_temizleyici %>%
  group_by(yearmonth(event_date)) %>%
  summarise(sold_count=mean(sold_count)) %>%
  rename(month='yearmonth(event_date)')
ggplot(monthly_yuz_temizleyici, aes(x=month,y=sold_count))+
  geom_line()+
  labs(title="Average monthly sales of Sleepy over time")

Too few data points, it is not meaningful to decompose.

7.2 Task 2

yuz_temizleyici_random <- yuz_temizleyici_decomposed$random
arima(yuz_temizleyici_random, order=c(1,0,0))
## 
## Call:
## arima(x = yuz_temizleyici_random, order = c(1, 0, 0))
## 
## Coefficients:
##          ar1  intercept
##       0.5466     0.9896
## s.e.  0.0424     0.0476
## 
## sigma^2 estimated as 0.1819:  log likelihood = -220.1,  aic = 446.21
arima(yuz_temizleyici_random, order=c(0,0,1))
## 
## Call:
## arima(x = yuz_temizleyici_random, order = c(0, 0, 1))
## 
## Coefficients:
##          ma1  intercept
##       0.4327     0.9901
## s.e.  0.0363     0.0324
## 
## sigma^2 estimated as 0.1988:  log likelihood = -237.25,  aic = 480.51
arima(yuz_temizleyici_random, order=c(2,0,0))
## 
## Call:
## arima(x = yuz_temizleyici_random, order = c(2, 0, 0))
## 
## Coefficients:
##          ar1      ar2  intercept
##       0.5880  -0.0756     0.9899
## s.e.  0.0505   0.0505     0.0442
## 
## sigma^2 estimated as 0.1809:  log likelihood = -218.99,  aic = 445.98
arima(yuz_temizleyici_random, order=c(1,0,2))
## 
## Call:
## arima(x = yuz_temizleyici_random, order = c(1, 0, 2))
## 
## Coefficients:
##          ar1     ma1     ma2  intercept
##       0.2936  0.2565  0.3036     0.9896
## s.e.  0.0905  0.0847  0.0598     0.0465
## 
## sigma^2 estimated as 0.1725:  log likelihood = -209.85,  aic = 429.7
auto.arima(yuz_temizleyici_random)
## Series: yuz_temizleyici_random 
## ARIMA(2,0,3) with non-zero mean 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2      ma3    mean
##       1.4414  -0.6675  -1.0483  0.3036  -0.1989  0.9882
## s.e.  0.0758   0.0672   0.0857  0.0765   0.0722  0.0050
## 
## sigma^2 estimated as 0.1443:  log likelihood=-172.91
## AIC=359.82   AICc=360.12   BIC=387.55

The model of order (2,0,3) performed better.

7.3 Task 3

ggpairs(yuz_temizleyici[,-c(1,2,5,7,10,12,13)])

Naturally, basket count is the highest correlated variable with the sold count. But for forecasting, I believe using category favored as regressor is a better option, because I expect the variability to be smaller, since it depends on the interest for a category of products, instead a specific products.

7.4 Task 4

arima(yuz_temizleyici_random, order=c(2,0,3), xreg = yuz_temizleyici$category_favored)
## 
## Call:
## arima(x = yuz_temizleyici_random, order = c(2, 0, 3), xreg = yuz_temizleyici$category_favored)
## 
## Coefficients:
##          ar1      ar2      ma1     ma2      ma3  intercept
##       1.3827  -0.3877  -0.8900  0.0884  -0.1544     0.1982
## s.e.  0.1541   0.1526   0.1505  0.0998   0.0763     0.0361
##       yuz_temizleyici$category_favored
##                                      0
## s.e.                               NaN
## 
## sigma^2 estimated as 0.1161:  log likelihood = -133.26,  aic = 282.51

Adding external regressors decreased AIC significantly.

auto.arima(yuz_temizleyici_random, xreg = yuz_temizleyici$category_favored)
## Series: yuz_temizleyici_random 
## Regression with ARIMA(0,1,1)(0,0,1)[15] errors 
## 
## Coefficients:
##           ma1    sma1   drift  xreg
##       -0.4189  0.0700  -1e-04     0
## s.e.   0.0599  0.0507   0e+00     0
## 
## sigma^2 estimated as 0.1341:  log likelihood=-158.41
## AIC=326.82   AICc=326.98   BIC=346.61

Auto arima does not work well in this case.

We selected our test period as “2021-06-24”-“2021-06-30”.

test_dates <- c(as.Date("2021-06-24"):as.Date("2021-06-30"))

for(i in 1:length(test_dates)){
  
  current_date=test_dates[i]-2
  
  past_data <- yuz_temizleyici[event_date<=current_date,]
  
  yuz_temizleyici_ts <- ts(past_data$sold_count, frequency = 15)
  yuz_temizleyici_decomposed <- decompose(yuz_temizleyici_ts,type = "multiplicative")
  model <- arima(yuz_temizleyici_decomposed$random,order = c(2,0,3), xreg=past_data$category_favored)
  forecasted=predict(model,n.ahead = 2,newxreg = yuz_temizleyici[event_date==test_dates[i],"category_favored"])
  yuz_temizleyici[nrow(yuz_temizleyici)-length(test_dates)+i,Model1:=forecasted$pred[2]*yuz_temizleyici_decomposed$seasonal[(nrow(yuz_temizleyici)-length(test_dates)+i)%%15+15]*yuz_temizleyici_decomposed$trend[max(which(!is.na(yuz_temizleyici_decomposed$trend)))]]
}

t<-accu(yuz_temizleyici$sold_count[(nrow(yuz_temizleyici)+1-length(test_dates)):(nrow(yuz_temizleyici))],yuz_temizleyici$Model1[(nrow(yuz_temizleyici)+1-length(test_dates)):(nrow(yuz_temizleyici))])
t
##   n     mean      sd        CV    FBias      MAPE     RMSE      MAD      MADP
## 1 7 71.28571 26.4305 0.3707686 0.233834 0.1771441 27.04648 16.67718 0.2339485
##       WMAPE
## 1 0.2339485
ggplot(yuz_temizleyici,aes(x=event_date))+
  geom_line(aes(y=Model1), color="red")+
  xlim(as.Date("2021-06-24"),as.Date("2021-06-30"))+ #Update here
  geom_line(aes(y=sold_count))

## Parsed with column specification:
## cols(
##   event_date = col_date(format = ""),
##   product_content_id = col_double(),
##   price = col_double(),
##   sold_count = col_double(),
##   visit_count = col_double(),
##   basket_count = col_double(),
##   favored_count = col_double(),
##   category_sold = col_double(),
##   category_visits = col_double(),
##   category_basket = col_double(),
##   category_favored = col_double(),
##   category_brand_sold = col_double(),
##   ty_visits = col_double()
## )
raw_data <- data.table(raw_data)
raw_data[, "event_date" := as.Date(event_date)]
raw_data <- raw_data[event_date >= '2020-05-25']
raw_data[, trend := 1:.N]
raw_data[, month := month(event_date, label = TRUE)]
raw_data[, wday := wday(event_date, label = TRUE)]
raw_data <- raw_data[order(event_date),]

8 Leopard Skin Bikini

The leopard skin bikini is the next product. To analyze the data we read it first.

data_leopar <- raw_data[product_content_id == 73318567] #leopar

To understand the trend, the time series graph of the number of sold items should be examined.

ggplot(data_leopar, aes(x=event_date)) + 
  geom_line(aes(y = sold_count), color = "red")  + ggtitle("Trend of Leopard Skin Bikini Sales") + xlab("Date") + ylab("Sales")

There are lots of missing points in the data. As we don’t have a full year data, we cannot check for a monthly trend. We will check for a daily trend and look at the autocorrelation plot to decide the lag.

acf(data_leopar$sold_count, main= "Daily Autocorrelation")

pacf(data_leopar$sold_count, main= "Daily Partial Autocorrelation") 

The data told us there is a trend, however any specific lag does not stand out. The autocorrelation plot was for the data including the N/A values. To have a better understanding, we should check the data also after removing the N/A’s.

Data with clearing:

leopar <- na.omit(data_leopar)
ggplot(leopar, aes(x=event_date)) + 
  geom_line(aes(y = sold_count), color = "red")  + ggtitle("Trend of Leopard Skin Bikini Sales N/A Omitted") + xlab("Date") + ylab("Sales")

Checking the autocorrelation and partial autocorrelation:

acf(leopar$sold_count, main= "Daily Autocorrelation")

pacf(leopar$sold_count, main= "Daily Autocorrelation")

The autocorrelation plot yields very similar results with N/A included version.

Although we did not see any peak point at day seven, it is logical to have a daily trend. That’s why, we will make the decomposition daily and have the frequency of 7. As the variance change over the time, we will make a multiplicative decomposition firstly and according to the results make an additive model.

To be able track the days of the week, we will use N/A included version in order not to loose any data point.

leoparts <- ts(data_leopar$sold_count,freq=7)
data_mult<-decompose(leoparts,type="multiplicative")
random=data_mult$random
plot(data_mult)

The trend is increasing as the summer time arrives, however we can see a sharp decrease in June due to the stock out of smaller sizes. When we examine the random part of the data, it seems pretty stationary with near to stable variance and mean except the two outlier peak points.

In this graph it is hard to identify seasonal daily trend, so we need to have a closer look.

plot(data_mult$seasonal[1:7], xlab = "Hour", ylab = "Multiplicative of Day", main = "Seasonal Component of Trend for Multiplicative")

mean_sold=data_leopar[,list(m_sold = mean(sold_count,na.rm=T)), by="wday"]
mean_sold
##    wday   m_sold
## 1:  Mon 16.10345
## 2:  Tue 17.22414
## 3:  Wed 18.56897
## 4:  Thu 18.41379
## 5:  Fri 17.22807
## 6:  Sat 20.78947
## 7:  Sun 20.50877

To understand the multiplicative decomposition, we examine the multipliers of the days. To be able to compare the multiplicative behaviours and data, we also examine the mean sold of each day. Saturdays are the maximum points and the trend also reflect it. However, although the mean very high on Sundays, the multiplier of Sundays are low.

Having higher sales on the weekends are normal, as people have time to shop more.

To check the stationarity of the detrend part we use KPSS test

unt_test=ur.kpss(data_mult$random) 
summary(unt_test)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 0.0863 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The data is stationary, because 0.0684 is lower even 0.347. Differencing results in 0.0519<0.347 implies corresponding the p-value is larger than 10%, we fail to reject the null hypothesis claiming the data are stationary.

The multipliers of the multiplicative model were not satisfying, we will also check the additive decomposition.

leoparts <- ts(data_leopar$sold_count,freq=7)
data_add<-decompose(leoparts,type="additive")
random=data_add$random
plot(data_add)

The trend is very similar to the additive version. For the detrend part, the outliers seems to be dissapered here. In the additive model, remaining portion seems more stationary.

In this graph it is hard to identify seasonal daily trend, so we need to have a closer look.

plot(data_add$seasonal[1:7], xlab = "Hour", ylab = "Additive of Day", main = "Seasonal Component of Trend for Additive")

mean_sold=data_leopar[,list(m_sold = mean(sold_count,na.rm=T)), by="wday"]
mean_sold
##    wday   m_sold
## 1:  Mon 16.10345
## 2:  Tue 17.22414
## 3:  Wed 18.56897
## 4:  Thu 18.41379
## 5:  Fri 17.22807
## 6:  Sat 20.78947
## 7:  Sun 20.50877

In the additive model, the multipliers of days are pretty similar to the mean sales of each days. The high trend on the weekends have a corresponding high multipliers.

To check the stationarity of the detrend part we use KPSS test again

unt_test=ur.kpss(data_add$random) 
summary(unt_test)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0074 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The data is stationary, because 0.0072 is lower even 0.347. Differencing results in 0.0519<0.347 implies corresponding the p-value is larger than 10%, we fail to reject the null hypothesis claiming the data are stationary. Also this is smaller than the multiplicative decomposition so I will continue my analysis with the additive model.

acf(random, na.action = na.pass, main= "Detrend's Autocorrelation")

pacf(random, na.action = na.pass, main= "Detrend's Autocorrelation")

We can see that the autocorrelation plot is decreasing. We can see the significant partial autocorrelation at lag 5. That’s why AR(5) will be tried first.

model <- arima(random, order=c(5,0,0))
print(model)
## 
## Call:
## arima(x = random, order = c(5, 0, 0))
## 
## Coefficients:
##          ar1      ar2      ar3      ar4      ar5  intercept
##       0.0361  -0.2422  -0.2045  -0.1282  -0.2927     0.0147
## s.e.  0.0479   0.0475   0.0479   0.0474   0.0478     0.2627
## 
## sigma^2 estimated as 91.12:  log likelihood = -1459.34,  aic = 2932.67

Also, the partial autocorrelaton plot seems to be decreasing and there is significant autocorrelation is at lag 5 at the autocorrelation plot. Let’s also try MA(5).

model <- arima(random, order=c(0,0,5))
print(model)
## 
## Call:
## arima(x = random, order = c(0, 0, 5))
## 
## Coefficients:
##           ma1      ma2      ma3      ma4      ma5  intercept
##       -0.0828  -0.3490  -0.2766  -0.1211  -0.1705     0.0015
## s.e.   0.0491   0.0545   0.0567   0.0481   0.0603     0.0117
## 
## sigma^2 estimated as 83.58:  log likelihood = -1444.24,  aic = 2902.47

As the AIC is lower than the first one this model is better.

We need to decide their combination and try the neighbours to decide which model to use.

model <- arima(random, order=c(5,0,5))
print(model)
## 
## Call:
## arima(x = random, order = c(5, 0, 5))
## 
## Coefficients:
##          ar1      ar2      ar3     ar4      ar5      ma1     ma2     ma3
##       0.1543  -0.3693  -0.4067  0.5051  -0.3374  -0.2525  0.0365  0.1156
## s.e.  0.1350   0.1064   0.0822  0.1241   0.0929   0.1424  0.1259  0.0791
##           ma4      ma5  intercept
##       -0.8388  -0.0609     0.0017
## s.e.   0.1260   0.1397     0.0093
## 
## sigma^2 estimated as 79.22:  log likelihood = -1434.22,  aic = 2892.43

This is the lowest AIC, thus the best model. Trying the neighbors:

modelf <- arima(random, order=c(5,0,4))
print(model)
## 
## Call:
## arima(x = random, order = c(5, 0, 5))
## 
## Coefficients:
##          ar1      ar2      ar3     ar4      ar5      ma1     ma2     ma3
##       0.1543  -0.3693  -0.4067  0.5051  -0.3374  -0.2525  0.0365  0.1156
## s.e.  0.1350   0.1064   0.0822  0.1241   0.0929   0.1424  0.1259  0.0791
##           ma4      ma5  intercept
##       -0.8388  -0.0609     0.0017
## s.e.   0.1260   0.1397     0.0093
## 
## sigma^2 estimated as 79.22:  log likelihood = -1434.22,  aic = 2892.43

Using MA(4) increased the performance. Checking for AR(4):

model <- arima(random, order=c(4,0,4))
print(model)
## 
## Call:
## arima(x = random, order = c(4, 0, 4))
## 
## Coefficients:
##           ar1      ar2     ar3      ar4     ma1      ma2      ma3     ma4
##       -0.0991  -0.0160  0.5042  -0.4244  0.0115  -0.3685  -0.9035  0.2605
## s.e.   0.1563   0.0888  0.0489   0.0891  0.1674   0.0512   0.0583  0.1523
##       intercept
##          0.0015
## s.e.     0.0091
## 
## sigma^2 estimated as 81.91:  log likelihood = -1440.96,  aic = 2901.91

This has lowered the performance so we will choose (5,0,4) for the arima models.

To understand the the the possible regressors that can be used in the ARIMA model, the correlations with both detrend data and sold_count is important.

Than, the lagged variable will be addded to the model as for the predictions we need to have the data.

leopar <- data_leopar
leopar <- leopar[, random := data_add$random]
numeric_data <- leopar
numeric_data$event_date = NULL
numeric_data$product_content_id = NULL
numeric_data$trend= NULL
numeric_data$month= NULL
numeric_data$wday= NULL
numeric_data$random= as.numeric(numeric_data$random)
numeric_data <- na.omit(numeric_data)
str(numeric_data)
## Classes 'data.table' and 'data.frame':   119 obs. of  12 variables:
##  $ price              : num  60 60 60 60 60 ...
##  $ sold_count         : num  1 2 2 2 1 1 5 3 6 1 ...
##  $ visit_count        : num  0 0 0 0 0 261 306 343 334 278 ...
##  $ basket_count       : num  4 11 13 7 2 7 25 28 35 17 ...
##  $ favored_count      : num  73 62 110 45 37 49 71 76 45 51 ...
##  $ category_sold      : num  124 156 219 148 134 175 457 372 360 229 ...
##  $ category_visits    : num  389 495 453 391 352 458 781 608 588 350 ...
##  $ category_basket    : num  0 0 0 0 0 ...
##  $ category_favored   : num  3246 3956 3427 2601 2167 ...
##  $ category_brand_sold: num  14526 13797 20377 8414 7112 ...
##  $ ty_visits          : num  1 1 1 1 1 ...
##  $ random             : num  -1.9975 -0.9524 3.0626 0.6528 0.0451 ...
##  - attr(*, ".internal.selfref")=<externalptr>
correl_info = cor(numeric_data)
ggcorrplot(correl_info, hc.order = TRUE, type = "lower",lab = TRUE)

The biggest correlation with the data is with basket_count, visit_count and favored_count. Thus we are adding their lagged version and examine also their correlation. The price do not have a specific correlation as in here the data does not reflect every discount.

leopar <- leopar[,favored_count_lag := shift(favored_count, 2)]
leopar <- leopar[,basket_count_lag := shift(basket_count, 2)]
leopar <- leopar[,visit_count_lag := shift(visit_count, 2)]

numeric_data <- leopar
numeric_data$event_date = NULL
numeric_data$product_content_id = NULL
numeric_data$trend= NULL
numeric_data$month= NULL
numeric_data$wday= NULL
numeric_data$random= as.numeric(numeric_data$random)
numeric_data <- na.omit(numeric_data)
str(numeric_data)
## Classes 'data.table' and 'data.frame':   119 obs. of  15 variables:
##  $ price              : num  60 60 60 60 60 ...
##  $ sold_count         : num  1 2 2 2 1 1 5 3 6 1 ...
##  $ visit_count        : num  0 0 0 0 0 261 306 343 334 278 ...
##  $ basket_count       : num  4 11 13 7 2 7 25 28 35 17 ...
##  $ favored_count      : num  73 62 110 45 37 49 71 76 45 51 ...
##  $ category_sold      : num  124 156 219 148 134 175 457 372 360 229 ...
##  $ category_visits    : num  389 495 453 391 352 458 781 608 588 350 ...
##  $ category_basket    : num  0 0 0 0 0 ...
##  $ category_favored   : num  3246 3956 3427 2601 2167 ...
##  $ category_brand_sold: num  14526 13797 20377 8414 7112 ...
##  $ ty_visits          : num  1 1 1 1 1 ...
##  $ random             : num  -1.9975 -0.9524 3.0626 0.6528 0.0451 ...
##  $ favored_count_lag  : num  0 37 73 110 83 37 45 33 71 76 ...
##  $ basket_count_lag   : num  0 4 4 13 11 2 12 5 25 28 ...
##  $ visit_count_lag    : num  0 0 0 0 0 0 144 84 306 343 ...
##  - attr(*, ".internal.selfref")=<externalptr>
correl_info = cor(numeric_data)
ggcorrplot(correl_info, hc.order = TRUE, type = "lower",lab = TRUE)

The biggest correlation is with the basket_count_lag, that’s why, we are trying it first.

reg_matrix=cbind( leopar$basket_count_lag) # can add more if any other regressors exist
   model1 <- arima(leopar$random, order = c(5,0,4), xreg = reg_matrix)
  summary(model1)
## 
## Call:
## arima(x = leopar$random, order = c(5, 0, 4), xreg = reg_matrix)
## 
## Coefficients:
##          ar1      ar2      ar3    ar4      ar5      ma1     ma2     ma3
##       0.1882  -0.3865  -0.4217  0.524  -0.3827  -0.3052  0.0533  0.1220
## s.e.  0.0628      NaN   0.0626    NaN   0.0465   0.0386     NaN  0.0575
##           ma4  intercept  reg_matrix
##       -0.8702    -0.0203       4e-04
## s.e.      NaN     0.0117       2e-04
## 
## sigma^2 estimated as 77.77:  log likelihood = -1430.69,  aic = 2885.38
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE    MAPE      MASE
## Training set 0.1958726 8.818632 3.977402 -17.58764 268.857 0.6383409
##                      ACF1
## Training set 0.0008931498

When we added the variable the AIC value has lowered, the model has improved. To see the performance and compare the two models, let’s check the predicted values and the actual data in the same graph.

model_fitted <- leopar$random - residuals(model1)
leopar <- cbind(leopar, data_add$seasonal, data_add$trend, model_fitted)
leopar <-leopar[,predict1 := data_add$seasonal + data_add$trend + model_fitted] 

model_fitted_2 <- leopar$random - residuals(modelf)
leopar <- cbind(leopar, model_fitted_2)
leopar <-leopar[,predictonlyar := data_add$seasonal + data_add$trend + model_fitted_2] 


ggplot(leopar, aes(x=event_date)) + 
  geom_line(aes(y = sold_count, color = "Sales")) + 
  geom_line(aes(y = predict1, color="Prediction with Arima")) + 
    geom_line(aes(y = predictonlyar, color = "Prediction with Arima + Regressors")) + ggtitle ("Actual vs Predicted Data")
## Warning: Removed 6 row(s) containing missing values (geom_path).

## Warning: Removed 6 row(s) containing missing values (geom_path).

Except for the peak points, the data seems to catch the trend well. As we can see from the plor both model yields pretty similar results.

Let’s add second highest correlation visit_count_lag.

reg_matrix=cbind( leopar$basket_count_lag, leopar$visit_count_lag) # can add more if any other regressors exist

   model2 <- arima(leopar$random,order = c(5,0,4), xreg = reg_matrix)
  summary(model2)
## 
## Call:
## arima(x = leopar$random, order = c(5, 0, 4), xreg = reg_matrix)
## 
## Coefficients:
##          ar1      ar2      ar3     ar4      ar5      ma1     ma2     ma3
##       0.1907  -0.3915  -0.4173  0.5104  -0.3844  -0.3121  0.0542  0.1162
## s.e.  0.0737   0.1435   0.0799  0.1112   0.0478   0.0695  0.1511  0.0820
##           ma4  intercept  reg_matrix1  reg_matrix2
##       -0.8583    -0.0099       0.0019       -1e-04
## s.e.   0.1372     0.0031       0.0008        0e+00
## 
## sigma^2 estimated as 77.44:  log likelihood = -1429.81,  aic = 2885.62
## 
## Training set error measures:
##                     ME     RMSE      MAE      MPE     MAPE      MASE
## Training set 0.1108262 8.799832 3.973089 140.9859 380.9705 0.6376486
##                      ACF1
## Training set 0.0003948911

The performance does not increase, AIC remains the same, that’s why we will continue with one regressor. To make the prediction and evaluate the performance of our model, we will use the function that we use during the lectures.

accu=function(actual,forecast){
  n=length(actual)
  error=actual-forecast
  mean=mean(actual)
  sd=sd(actual)
  CV=sd/mean
  FBias=sum(error)/sum(actual)
  MAPE=sum(abs(error/actual))/n
  RMSE=sqrt(sum(error^2)/n)
  MAD=sum(abs(error))/n
  MADP=sum(abs(error))/sum(abs(actual))
  WMAPE=MAD/mean
  l=data.frame(n,mean,sd,CV,FBias,MAPE,RMSE,MAD,MADP,WMAPE)
  return(l)
}

We need to seperate the test and train data. We will use last 7 days to make the prediction. In order to compare the model we will examine both arima with regressor and arima stand alone.

Arima with regressor:

train_start=as.Date('2021-01-23')
test_start=as.Date('2021-06-24')
test_end=as.Date('2021-06-30')

test_dates=seq(test_start,test_end,by='day')

for(i in 1:length(test_dates)){
    
    current_date=test_dates[i]-2
    
    past_data <- leopar[event_date<=current_date,]
    
    leopar_ts <- ts(past_data$sold_count, frequency = 7)  
    leopar_decomposed <- decompose(leopar_ts, type="additive")
    model <- arima(leopar_decomposed$random,order = c(5,0,4),xreg = past_data$basket_count_lag)
    
    forecasted=predict(model,n.ahead = 2,newxreg = leopar[event_date==test_dates[i],basket_count_lag])
    leopar[nrow(leopar)-length(test_dates)+i-2, Model_reg := forecasted$pred[2]+leopar_decomposed$seasonal[(nrow(leopar)-length(test_dates)+i-2)%%7+7]+leopar_decomposed$trend[max(which(!is.na(leopar_decomposed$trend)))]]
  }
  m_with_reg<-accu(leopar$sold_count[(nrow(leopar)-1-length(test_dates)):(nrow(leopar)-2)],(leopar$Model_reg[(nrow(leopar)-1-length(test_dates)):(nrow(leopar)-2)]))  

Arima without regressor:

for(i in 1:length(test_dates)){
    
    current_date=test_dates[i]-2
    
    past_data <- leopar[event_date<=current_date,]
    
    leopar_ts <- ts(past_data$sold_count, frequency = 7)  
    leopar_decomposed <- decompose(leopar_ts, type="additive")
    model <- arima(leopar_decomposed$random,order = c(5,0,4))
    
    forecasted=predict(model,n.ahead = 2)
    leopar[nrow(leopar)-length(test_dates)+i-2, Model_nolag := forecasted$pred[2]+leopar_decomposed$seasonal[(nrow(leopar)-length(test_dates)+i-2)%%7+7]+leopar_decomposed$trend[max(which(!is.na(leopar_decomposed$trend)))]]
  }
  m_only_arima<-accu(leopar$sold_count[(nrow(leopar)-1-length(test_dates)):(nrow(leopar)-2)],(leopar$Model_nolag[(nrow(leopar)-1-length(test_dates)):(nrow(leopar)-2)]))  
rbind(m_with_reg,m_only_arima)
##   n mean       sd        CV      FBias      MAPE     RMSE      MAD      MADP
## 1 7   27 8.485281 0.3142697 -0.2953161 0.4192115 11.12198 9.559140 0.3540422
## 2 7   27 8.485281 0.3142697 -0.2910258 0.4130253 11.02719 9.380608 0.3474299
##       WMAPE
## 1 0.3540422
## 2 0.3474299

When we compare the results, ARIMA without regressors seem to be performed slighly better. However, this is not a significant difference.

9 Black Bikini

The black bikini is the next product. To analyze the data we read it first.

data_siyah <- raw_data[product_content_id == 32737302] 

To understand the trend, the time series graph of the number of sold items should be examined.

ggplot(data_siyah, aes(x=event_date)) + 
  geom_line(aes(y = sold_count), color = "red") + ggtitle("Trend of Black Bikini Sales") + xlab("Date") + ylab("Sales")

There are lots of missing points in the data. As we don’t have a full year data, we cannot check for a monthly trend. We will check for a daily trend and examine the autocorrelation plot.

acf(data_siyah$sold_count, main= "Daily Autocorrelation")

siyah <- data_siyah

The data told us there is a trend, however any specific lag does not stand out. We do not omit N/A’s in order not to loose any information.

Although we did not see any peak point at day seven, it is logical to have a daily trend. That’s why we will make the decomposition daily. As the variance change over the time, we will first make a multiplicative decomposition.

siyahts <- ts(siyah$sold_count,freq=7)
data_mult<-decompose(siyahts,type="multiplicative")
random=data_mult$random
plot(data_mult)

The trend is increasing as the summer time arrives, however we can see a sharp decrease in June due to the stock out of smaller sizes. The detrend part of the data have many peaks which may cause to unstationarity. However, except the peaks mean and variance seem stable, so it look stationary.

In this graph it is hard to identify seasonal trend, daily trend, so we need to have a closer look.

plot(data_mult$seasonal[1:7], xlab = "Hour", ylab = "Multiplicative of Day", main = "Seasonal Component of Trend for Multiplicative")

To understand the multiplicative decomposition, we examine the multipliers of the days. To be able to compare the multiplicative behaviours and data, we also examine the mean sales of each day. Tuesdays are the maximum points and the trend also reflect it. However, although the mean very high on Wednesdays, the multiplier of Wednesdays are low.

Having higher sales on the early weekdays are normal, as people want the delivery before the end of the week.

mean_sold=siyah[,list(mean_sold = mean(sold_count,na.rm=T)), by="wday"]
mean_sold
##    wday mean_sold
## 1:  Mon  14.01724
## 2:  Tue  16.55172
## 3:  Wed  16.29310
## 4:  Thu  15.25862
## 5:  Fri  13.64912
## 6:  Sat  12.43860
## 7:  Sun  15.05263

To check the stationarity of the detrend part we use KPSS test

unt_test=ur.kpss(data_mult$random) 
summary(unt_test)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 0.0722 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The data is stationary, because 0.07 is lower even 0.347. Differencing results in 0.0519<0.347 implies corresponding the p-value is larger than 10%, we fail to reject the null hypothesis claiming the data are stationary.

The multipliers of the multiplicative model were not satisfying, we will also check the additive decomposition.

siyahts <- ts(siyah$sold_count,freq=7)
data_add<-decompose(siyahts,type="additive")
random=data_add$random
plot(data_add)

The trend is very similar to the additive version. For the detrend part, the outliers seems to be dissapered here. In the additive model, remaining portion seems more stationary.

In this graph it is hard to identify seasonal daily trend, so we need to have a closer look.

plot(data_add$seasonal[1:7], xlab = "Hour", ylab = "Additive of Day", main = "Seasonal Component of Trend for Additive")

mean_sold=siyah[,list(m_sold = mean(sold_count,na.rm=T)), by="wday"]
mean_sold
##    wday   m_sold
## 1:  Mon 14.01724
## 2:  Tue 16.55172
## 3:  Wed 16.29310
## 4:  Thu 15.25862
## 5:  Fri 13.64912
## 6:  Sat 12.43860
## 7:  Sun 15.05263

In the additive model, the multipliers of days are pretty similar to the mean sales of each days. The high trend on the Tuesdays and Wednesdays have a corresponding high multipliers.

To check the stationarity of the detrend part we use KPSS test again

unt_test=ur.kpss(data_add$random) 
summary(unt_test)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0063 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The data is stationary, because 0.0066 is lower even 0.347. Differencing results in 0.0519<0.347 implies corresponding the p-value is larger than 10%, we fail to reject the null hypothesis claiming the data are stationary. Also this is smaller than the multiplicative decomposition so I will continue my analysis with the additive model.

acf(random, na.action = na.pass, main= "Detrend's Autocorrelation")

pacf(random, na.action = na.pass, main= "Detrend's Autocorrelation")

We can see that the autocorrelation plot is decreasing. We can see the significant partial autocorrelation in lag 6. That’s why AR(6) will be tried first.

model <- arima(random, order=c(6,0,0))
print(model)
## 
## Call:
## arima(x = random, order = c(6, 0, 0))
## 
## Coefficients:
##          ar1      ar2      ar3      ar4     ar5      ar6  intercept
##       0.0369  -0.3348  -0.4273  -0.1462  -0.208  -0.3156     0.0034
## s.e.  0.0477   0.0465   0.0490   0.0492   0.047   0.0487     0.1135
## 
## sigma^2 estimated as 29:  log likelihood = -1232.44,  aic = 2480.87

Also, the partial autocorrelaton plot is decreasing and max autocorrelation is at lag 3. Let’s also try MA(3).

modelf <- arima(random, order=c(0,0,3))
print(modelf)
## 
## Call:
## arima(x = random, order = c(0, 0, 3))
## 
## Coefficients:
##           ma1      ma2      ma3  intercept
##       -0.0715  -0.4262  -0.5023    -0.0043
## s.e.   0.0452   0.0376   0.0484     0.0054
## 
## sigma^2 estimated as 26.69:  log likelihood = -1217.92,  aic = 2445.84

As the AIC is lower than the first one this model is better.

We need to decide their combination and try the neighbours to decide which model to use.

model <- arima(random, order=c(6,0,3))
print(model)
## 
## Call:
## arima(x = random, order = c(6, 0, 3))
## 
## Coefficients:
##           ar1      ar2     ar3      ar4      ar5      ar6     ma1      ma2
##       -0.5865  -0.0903  0.0261  -0.4205  -0.1071  -0.1649  0.4962  -0.5314
## s.e.   0.0533   0.0582  0.0576   0.0570   0.0580   0.0536  0.0259   0.0147
##           ma3  intercept
##       -0.9649    -0.0044
## s.e.   0.0298     0.0031
## 
## sigma^2 estimated as 23.33:  log likelihood = -1192.87,  aic = 2407.74

The model did not improved, we will continue with MA and try its neighbours:

model <- arima(random, order=c(0,0,4))
print(model)
## 
## Call:
## arima(x = random, order = c(0, 0, 4))
## 
## Coefficients:
##           ma1      ma2      ma3     ma4  intercept
##       -0.1057  -0.4504  -0.5138  0.0699    -0.0042
## s.e.   0.0558   0.0459   0.0475  0.0699     0.0051
## 
## sigma^2 estimated as 26.61:  log likelihood = -1217.41,  aic = 2446.81
model <- arima(random, order=c(0,0,2))
print(model)
## 
## Call:
## arima(x = random, order = c(0, 0, 2))
## 
## Coefficients:
##           ma1      ma2  intercept
##       -0.3481  -0.6519    -0.0044
## s.e.   0.0323   0.0319     0.0040
## 
## sigma^2 estimated as 31.67:  log likelihood = -1251.98,  aic = 2511.96

Both of them have lowered the performance so we will choose (0,0,3) for the arima models.

To understand the the the possible regressors that can be used in the ARIMA model, the correlations with both detrend data and sold_count is important.

Than, the lagged variable will be adwded to the model as for the predictions we need to have the data.

siyah <- siyah[, random := data_add$random]
numeric_data <- siyah
numeric_data$event_date = NULL
numeric_data$product_content_id = NULL
numeric_data$trend= NULL
numeric_data$month= NULL
numeric_data$wday= NULL
numeric_data$random= as.numeric(numeric_data$random)
numeric_data <- na.omit(numeric_data)
str(numeric_data)
## Classes 'data.table' and 'data.frame':   175 obs. of  12 variables:
##  $ price              : num  60 60 60 60 60.5 ...
##  $ sold_count         : num  33 33 39 44 32 32 21 28 25 27 ...
##  $ visit_count        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ basket_count       : num  150 161 169 186 113 117 133 132 106 119 ...
##  $ favored_count      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ category_sold      : num  722 968 1279 1320 992 ...
##  $ category_visits    : num  1052 1374 1758 1853 1429 ...
##  $ category_basket    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ category_favored   : num  5566 7406 9076 9482 6511 ...
##  $ category_brand_sold: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ty_visits          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ random             : num  -5.501 -3.07 5.878 10.035 0.243 ...
##  - attr(*, ".internal.selfref")=<externalptr>
correl_info = cor(numeric_data)
ggcorrplot(correl_info, hc.order = TRUE, type = "lower",lab = TRUE)

The biggest correlation with the data is basket_count, visit_count and the category_sold. To be able to use them in our prediction, we are adding the lagged versions.

For the price it does not have true correlation, as the data does not contain all the discount types.

siyah <- siyah[,category_sold_lag := shift(category_sold, 2)]
siyah <- siyah[,basket_count_lag := shift(basket_count, 2)]
siyah <- siyah[,visit_count_lag := shift(visit_count, 2)]

numeric_data <- siyah
numeric_data$event_date = NULL
numeric_data$product_content_id = NULL
numeric_data$trend= NULL
numeric_data$month= NULL
numeric_data$wday= NULL
numeric_data$random= as.numeric(numeric_data$random)
numeric_data <- na.omit(numeric_data)
str(numeric_data)
## Classes 'data.table' and 'data.frame':   175 obs. of  15 variables:
##  $ price              : num  60 60 60 60 60.5 ...
##  $ sold_count         : num  33 33 39 44 32 32 21 28 25 27 ...
##  $ visit_count        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ basket_count       : num  150 161 169 186 113 117 133 132 106 119 ...
##  $ favored_count      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ category_sold      : num  722 968 1279 1320 992 ...
##  $ category_visits    : num  1052 1374 1758 1853 1429 ...
##  $ category_basket    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ category_favored   : num  5566 7406 9076 9482 6511 ...
##  $ category_brand_sold: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ty_visits          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ random             : num  -5.501 -3.07 5.878 10.035 0.243 ...
##  $ category_sold_lag  : num  1223 848 722 968 1279 ...
##  $ basket_count_lag   : num  211 141 150 161 169 186 113 117 133 132 ...
##  $ visit_count_lag    : num  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, ".internal.selfref")=<externalptr>
correl_info = cor(numeric_data)
ggcorrplot(correl_info, hc.order = TRUE, type = "lower",lab = TRUE)

The correlations with lagged variables are all very closed to each other. Let’s add them one by one to see the improvement.

reg_matrix=cbind( siyah$basket_count_lag) # can add more if any other regressors exist
   model1 <- arima(siyah$random,order = c(0,0,3), xreg = reg_matrix)
  summary(model1)
## 
## Call:
## arima(x = siyah$random, order = c(0, 0, 3), xreg = reg_matrix)
## 
## Coefficients:
##           ma1      ma2      ma3  intercept  reg_matrix
##       -0.0719  -0.4263  -0.5018    -0.0091       1e-04
## s.e.   0.0453   0.0376   0.0484     0.0088       2e-04
## 
## sigma^2 estimated as 26.66:  log likelihood = -1217.67,  aic = 2447.35
## 
## Training set error measures:
##                       ME     RMSE      MAE      MPE    MAPE     MASE
## Training set -0.03134201 5.163451 2.973833 63.64687 99.8179 0.616558
##                     ACF1
## Training set -0.02027125

The AIC does not improve when we add the lagged variable.

model_fitted <- siyah$random - residuals(model1)
siyah <- cbind(siyah, data_add$seasonal, data_add$trend, model_fitted)
siyah <-siyah[,predictarima := data_add$seasonal + data_add$trend + model_fitted  ] 

model_fitted_2 <- siyah$random - residuals(modelf)
siyah <- cbind(siyah, model_fitted_2)
siyah <-siyah[,predictonlyar := data_add$seasonal + data_add$trend + model_fitted_2] 

ggplot(siyah, aes(x=event_date)) + 
  geom_line(aes(y = sold_count), color = "red") + 
  geom_line(aes(y = predictarima), color="blue") + ggtitle("Trend of Black Bikini Sales") + xlab("Date") + ylab("Sales")
## Warning: Removed 6 row(s) containing missing values (geom_path).

Adding the second:

reg_matrix=cbind( siyah$visit_count_lag) 

   model2 <- arima(siyah$random,order = c(0,0,3), xreg = reg_matrix)
  summary(model2)
## 
## Call:
## arima(x = siyah$random, order = c(0, 0, 3), xreg = reg_matrix)
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs üretimi
##           ma1      ma2      ma3  intercept  reg_matrix
##       -0.0719  -0.4263  -0.5018    -0.0092           0
## s.e.   0.0453   0.0376   0.0484        NaN         NaN
## 
## sigma^2 estimated as 26.66:  log likelihood = -1217.63,  aic = 2447.26
## 
## Training set error measures:
##                       ME     RMSE      MAE      MPE     MAPE      MASE
## Training set -0.02003195 5.162873 2.971803 63.58047 99.35088 0.6161371
##                     ACF1
## Training set -0.02029881

The AIC has not lowered, so the model did not improve, let’s try last lagged variable to see any improvements.

reg_matrix=cbind( siyah$category_sold_lag) 

   model3 <- arima(siyah$random,order = c(0,0,3), xreg = reg_matrix)
  summary(model3)
## 
## Call:
## arima(x = siyah$random, order = c(0, 0, 3), xreg = reg_matrix)
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs üretimi
##           ma1      ma2      ma3  intercept  reg_matrix
##       -0.0715  -0.4263  -0.5022    -0.0094           0
## s.e.   0.0453   0.0375   0.0484        NaN         NaN
## 
## sigma^2 estimated as 26.67:  log likelihood = -1217.75,  aic = 2447.5
## 
## Training set error measures:
##                       ME     RMSE      MAE      MPE     MAPE      MASE
## Training set -0.06101857 5.164487 2.985519 63.99305 101.3311 0.6189809
##                     ACF1
## Training set -0.02013689

The AIC is higher, thus the performance is lowered with the addition of every regressors. Visit count lag has the lowest AIC values, we will make the prediction with it and only arima ad compare the results.

Testing with regressor:

train_start=as.Date('2021-01-23')
test_start=as.Date('2021-06-24')
test_end=as.Date('2021-06-30')

test_dates=seq(test_start,test_end,by='day')

for(i in 1:length(test_dates)){
    
    current_date=test_dates[i]-2
    
    past_data <- siyah[event_date<=current_date,]
    
    siyah_ts <- ts(past_data$sold_count, frequency = 7)  
    siyah_decomposed <- decompose(siyah_ts, type="additive")
    model <- arima(siyah_decomposed$random,order = c(0,0,3),xreg = past_data$visit_count_lag)
    
    forecasted=predict(model,n.ahead = 2,newxreg = siyah[event_date==test_dates[i],visit_count_lag])
    siyah[nrow(siyah)-length(test_dates)+i-2, Model_reg := forecasted$pred[2]+siyah_decomposed$seasonal[(nrow(siyah)-length(test_dates)+i-2)%%7+7]+siyah_decomposed$trend[max(which(!is.na(siyah_decomposed$trend)))]]
  }
  m_with_lag<-accu(siyah$sold_count[(nrow(siyah)-1-length(test_dates)):(nrow(siyah)-2)],(siyah$Model_reg[(nrow(siyah)-1-length(test_dates)):(nrow(siyah)-2)]))  

Testing with only arima:

for(i in 1:length(test_dates)){
    
    current_date=test_dates[i]-2
    
    past_data <- siyah[event_date<=current_date,]
    
    siyah_ts <- ts(past_data$sold_count, frequency = 7)  
    siyah_decomposed <- decompose(siyah_ts, type="additive")
    model <- arima(siyah_decomposed$random,order = c(0,0,3))
    
    forecasted=predict(model,n.ahead = 2)
    siyah[nrow(siyah)-length(test_dates)+i-2, Model_no_reg := forecasted$pred[2]+siyah_decomposed$seasonal[(nrow(siyah)-length(test_dates)+i-2)%%7+7]+siyah_decomposed$trend[max(which(!is.na(siyah_decomposed$trend)))]]
  }
  m_without_lag<-accu(siyah$sold_count[(nrow(siyah)-1-length(test_dates)):(nrow(siyah)-2)],(siyah$Model_no_reg[(nrow(siyah)-1-length(test_dates)):(nrow(siyah)-2)]))  

Without lag version performed slightly better, it is accepted as the AIC value were also lower for this.

10 Leggings

The next product is the leggings:

data_tayt <- raw_data[product_content_id == 31515569]

To have a better understanding, we plot the time series of the legging sales

ggplot(data_tayt, aes(x=event_date)) + 
  geom_line(aes(y = sold_count), color = "red") + ggtitle("Trend of Leggings Sales") + xlab("Date") + ylab("Sales")

We will check for a daily trend, we do not have enough data to check monthly or weekly trend. We expect a 7 days frequency, to be able to see the seasonlity we will control autocorrelation.

acf(data_tayt$sold_count, main= "Daily Autocorrelation")

pacf(data_tayt$sold_count, main= "Daily Autocorrelation")

We have a peak at lag 16. We will use it in the decomposition. We will also check 7 days decomposition and compare the two according to their

In order not to lose information, we do not delete N/A’s, the days with 0 sales

Although we did not see any peak point at day seven, it is logical to have a daily trend. That’s why we will make the decomposition daily. As the variance change over the time, we will make a multiplicative decomposition. As there is a peak at lag 16, we will also try decomposition at lag 16.

tayt <- data_tayt
taytts <- ts(tayt$sold_count,freq=7)
data_mult<-decompose(taytts,type="multiplicative")
random=data_mult$random
plot(data_mult)

We have peaks in the, there are peak sales in the November. The leggings can be preferred to be worn in the autumun. This can be also the low prices in that time. The trend part of the decomposition seems to catch the overall trend.

When we examine the detrend part, we see a stationary data with a constant mean and variance.

In this graph it is hard to identify seasonal trend, daily trend, so we need to have a closer look.

plot(data_mult$seasonal[1:7], xlab = "Hour", ylab = "Multiplicative of Day", main = "Seasonal Component of Trend for Multiplicative")

To compare the seasonality multipliers with sales mean of that days, we examine the mean sales:

mean_sold=tayt[,list(mean_sold = mean(sold_count,na.rm=T)), by="wday"]
mean_sold
##    wday mean_sold
## 1:  Mon  811.1379
## 2:  Tue  940.0000
## 3:  Wed 1034.5345
## 4:  Thu 1039.1897
## 5:  Fri  696.5789
## 6:  Sat  580.2807
## 7:  Sun  608.4035

The seasonality seems to reflect the actual mean values

To check the stationarity of the detrend part

unt_test=ur.kpss(data_mult$random) 
summary(unt_test)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.1184 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The data is stationary, because 0.1325 is lower even 0.347. Differencing results in 0.0519<0.347 implies corresponding the p-value is larger than 10%, we fail to reject the null hypothesis claiming the data are stationary.

We will also control the additive decomposition to see whether it performs better.

taytts <- ts(tayt$sold_count,freq=7)
data_add<-decompose(taytts,type="additive")
random=data_add$random
plot(data_add)

The trend part of the decomposition seems to catch the overall trend weel.

When we examine the detrend part, we see a stationary data with a constant mean and variance. When compared to multiplicative the mean is smaller and the outlier peak can be identified in the detrend part.

In this graph it is hard to identify seasonal trend, daily trend, so we need to have a closer look.

plot(data_add$seasonal[1:7], xlab = "Hour", ylab = "Additive of Day", main = "Seasonal Component of Trend for Additive")

The seasonality seems to reflect the mean valeus of the days, the seasonality is pretty similar to the multiplicative decomposition results.

To check the stationarity of the detrend part

unt_test=ur.kpss(data_add$random) 
summary(unt_test)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0063 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The data is stationary, because 0.0063 is lower even 0.347. Differencing results in 0.0519<0.347 implies corresponding the p-value is larger than 10%, we fail to reject the null hypothesis claiming the data are stationary. This is also smaller than the multiplicative decomposition, that’s why we will continue with the additive decomposition.

As the autocorrelation plot gives higher values at lag 16, we will examine the data. We will use addiitive decomposition as it performmed better for the decomposion with 7 days frequency.

taytts <- ts(tayt$sold_count,freq=16)
data_add_2<-decompose(taytts,type="additive")
plot(data_add_2)

During the peak time, it seems like the data catches the decrease later. Random part seems similar to other model

In this graph it is hard to identify seasonal trend, daily trend, so we need to have a closer look.

plot(data_add_2$seasonal[1:16], xlab = "Hour", ylab = "Additive of Day", main = "Seasonal Component of Trend for Additive")

Seasonality seems to have the peaks at the second weeks Tuesdays and Wednesdays, however there is 16 days, we are tracking of the days of week in this decomposition.

To check the stationarity of the detrend part

unt_test=ur.kpss(data_add_2$random) 
summary(unt_test)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0056 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The random part is also stationary, however we will continue with our analysis with 7 days decomposition.

acf(random, na.action = na.pass, main= "Detrend's Autocorrelation")

pacf(random, na.action = na.pass, main= "Detrend's Autocorrelation")

We can see that the autocorrelation plot is decreasing. We can see the significant partial autocorrelation at lag 4. That’s why AR(4) will be tried first.

model <- arima(random, order=c(4,0,0))
print(model)
## 
## Call:
## arima(x = random, order = c(4, 0, 0))
## 
## Coefficients:
##          ar1      ar2      ar3      ar4  intercept
##       0.2668  -0.1144  -0.1979  -0.2597    -0.0114
## s.e.  0.0485   0.0494   0.0493   0.0484    18.6062
## 
## sigma^2 estimated as 232696:  log likelihood = -3016.65,  aic = 6045.3

Also, the partial autocorrelaton plot is decreasing and significant autocorrelation is at lag 4. Let’s also try MA(4).

model <- arima(random, order=c(0,0,4))
print(model)
## 
## Call:
## arima(x = random, order = c(0, 0, 4))
## 
## Coefficients:
##          ma1      ma2      ma3      ma4  intercept
##       0.0451  -0.3005  -0.4005  -0.3441     0.0780
## s.e.  0.0461   0.0457   0.0431   0.0512     0.6099
## 
## sigma^2 estimated as 202322:  log likelihood = -2991.02,  aic = 5994.03

As the AIC is lower than the first one this model is better.

We need to decide their combination and try the neighbours to decide which model to use.

model <- arima(random, order=c(4,0,4))
print(model)
## 
## Call:
## arima(x = random, order = c(4, 0, 4))
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs üretimi
##          ar1     ar2      ar3     ar4     ma1      ma2      ma3     ma4
##       0.2376  0.1942  -0.0391  -0.401  -0.244  -0.5604  -0.3554  0.1598
## s.e.     NaN  0.5659      NaN     NaN     NaN   0.5899      NaN     NaN
##       intercept
##          0.0598
## s.e.     0.3341
## 
## sigma^2 estimated as 183334:  log likelihood = -2972.15,  aic = 5964.3

This is the lowest AIC, thus the best model. Trying the neighbors:

model <- arima(random, order=c(3,0,4))
print(model)
## 
## Call:
## arima(x = random, order = c(3, 0, 4))
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs üretimi
##          ar1     ar2      ar3      ma1      ma2     ma3     ma4  intercept
##       0.6368  0.6023  -0.6433  -0.7536  -1.0818  0.5753  0.2601     0.0704
## s.e.     NaN     NaN      NaN      NaN      NaN     NaN  0.0237     0.0767
## 
## sigma^2 estimated as 171534:  log likelihood = -2960.91,  aic = 5939.82

Using AR(3) increased the performance. Checking for MA(3):

modelf <- arima(random, order=c(4,0,3))
print(modelf)
## 
## Call:
## arima(x = random, order = c(4, 0, 3))
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs üretimi
##          ar1     ar2      ar3     ar4      ma1      ma2     ma3  intercept
##       0.9752  0.3597  -0.8759  0.2881  -1.0612  -0.7895  0.8508      0.074
## s.e.     NaN     NaN      NaN  0.0329      NaN      NaN     NaN      0.073
## 
## sigma^2 estimated as 168325:  log likelihood = -2957.1,  aic = 5932.2

This is the lowest AIC. We will continue our model with (4,0,3)

To understand the the the possible regressors that can be used in the ARIMA model, the correlations with both detrend data and sold_count is important.

Than, the lagged variable will be addded to the model as for the predictions we need to have the data.

tayt <- tayt[, random := data_add$random]
numeric_data <- tayt
numeric_data$event_date = NULL
numeric_data$product_content_id = NULL
numeric_data$trend= NULL
numeric_data$month= NULL
numeric_data$wday= NULL
numeric_data$random= as.numeric(numeric_data$random)
numeric_data <- na.omit(numeric_data)
str(numeric_data)
## Classes 'data.table' and 'data.frame':   397 obs. of  12 variables:
##  $ price              : num  45 45 45 45 45 ...
##  $ sold_count         : num  366 1188 1162 895 649 ...
##  $ visit_count        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ basket_count       : num  2293 6379 5640 4200 3084 ...
##  $ favored_count      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ category_sold      : num  836 1581 1530 1353 868 ...
##  $ category_visits    : num  7436 7736 9176 8896 7154 ...
##  $ category_basket    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ category_favored   : num  35389 36277 42223 38858 30893 ...
##  $ category_brand_sold: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ty_visits          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ random             : num  -563 604 665 316 -162 ...
##  - attr(*, ".internal.selfref")=<externalptr>
correl_info = cor(numeric_data)
ggcorrplot(correl_info, hc.order = TRUE, type = "lower",lab = TRUE)

The biggest correlation with the data is basket_count and category_sold. To use them in the regression, we are adding to the model with the lagged versions.

tayt <- tayt[,category_sold_lag := shift(category_sold, 2)]
tayt <- tayt[,basket_count_lag := shift(basket_count, 2)]


numeric_data <- tayt
numeric_data$event_date = NULL
numeric_data$product_content_id = NULL
numeric_data$trend= NULL
numeric_data$month= NULL
numeric_data$wday= NULL
numeric_data$random= as.numeric(numeric_data$random)
numeric_data <- na.omit(numeric_data)
str(numeric_data)
## Classes 'data.table' and 'data.frame':   397 obs. of  14 variables:
##  $ price              : num  45 45 45 45 45 ...
##  $ sold_count         : num  366 1188 1162 895 649 ...
##  $ visit_count        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ basket_count       : num  2293 6379 5640 4200 3084 ...
##  $ favored_count      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ category_sold      : num  836 1581 1530 1353 868 ...
##  $ category_visits    : num  7436 7736 9176 8896 7154 ...
##  $ category_basket    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ category_favored   : num  35389 36277 42223 38858 30893 ...
##  $ category_brand_sold: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ty_visits          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ random             : num  -563 604 665 316 -162 ...
##  $ category_sold_lag  : num  953 777 836 1581 1530 ...
##  $ basket_count_lag   : num  2226 1754 2293 6379 5640 ...
##  - attr(*, ".internal.selfref")=<externalptr>
correl_info = cor(numeric_data)
ggcorrplot(correl_info, hc.order = TRUE, type = "lower",lab = TRUE)

The basket_count with lag have the highest autocorrelation, that’s why start with it.

reg_matrix=cbind( tayt$basket_count_lag) # can add more if any other regressors exist
   model1 <- arima(tayt$random,order = c(4,0,3), xreg = reg_matrix)
  summary(model)
## 
## Call:
## arima(x = random, order = c(3, 0, 4))
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs üretimi
##          ar1     ar2      ar3      ma1      ma2     ma3     ma4  intercept
##       0.6368  0.6023  -0.6433  -0.7536  -1.0818  0.5753  0.2601     0.0704
## s.e.     NaN     NaN      NaN      NaN      NaN     NaN  0.0237     0.0767
## 
## sigma^2 estimated as 171534:  log likelihood = -2960.91,  aic = 5939.82
## 
## Training set error measures:
##                     ME     RMSE      MAE      MPE    MAPE      MASE       ACF1
## Training set -4.901652 414.1668 227.7179 97.47823 154.436 0.7217732 0.02486137

The AIC remain the same, we can say that the model neither improved nor worsened.

We add the other potential regressor to the model.

reg_matrix=cbind(tayt$category_sold_lag) 

   model1 <- arima(tayt$random,order = c(4,0,3), xreg = reg_matrix)
  summary(model1)
## 
## Call:
## arima(x = tayt$random, order = c(4, 0, 3), xreg = reg_matrix)
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs üretimi
##          ar1     ar2      ar3     ar4      ma1      ma2     ma3  intercept
##       0.9173  0.4814  -0.9728  0.3166  -1.0111  -0.9191  0.9305     -0.844
## s.e.     NaN     NaN      NaN     NaN      NaN      NaN     NaN        NaN
##       reg_matrix
##            4e-04
## s.e.         NaN
## 
## sigma^2 estimated as 165884:  log likelihood = -2954.36,  aic = 5928.72
## 
## Training set error measures:
##                    ME     RMSE      MAE      MPE     MAPE      MASE        ACF1
## Training set 8.982585 407.2884 226.2653 103.9454 162.7711 0.7171692 -0.01527941

The AIC is lowered, we can use the category_sold variable as lag

To see ARIMA without lags and with lag in the same graph:

model_fitted <- tayt$random - residuals(model1)
tayt <- cbind(tayt, data_add$seasonal, data_add$trend, model_fitted)
tayt <-tayt[,predictarima := data_add$seasonal + data_add$trend + model_fitted  ] 

model_fitted_2 <- tayt$random - residuals(modelf)
tayt <- cbind(tayt,model_fitted_2)
tayt <-tayt[,predictarima_no_reg := data_add$seasonal + data_add$trend + model_fitted_2  ] 

ggplot(tayt, aes(x=event_date)) + 
  geom_line(aes(y = sold_count, color = "sold_count")) + 
  geom_line(aes(y = predictarima, color = "only_arima_prediction")) + 
   geom_line(aes(y = predictarima_no_reg, color = "arima_with_lag")) + ggtitle("Trend of Leggings Sales") + xlab("Date") + ylab("Sales")
## Warning: Removed 6 row(s) containing missing values (geom_path).

## Warning: Removed 6 row(s) containing missing values (geom_path).

The two models yield pretty similar results and both seem to catch the trend except the peak points.

To predict the data:

train_start=as.Date('2021-01-23')
test_start=as.Date('2021-06-23')
test_end=as.Date('2021-06-30')

test_dates=seq(test_start,test_end,by='day')

for(i in 1:length(test_dates)){
    
    current_date=test_dates[i]-2
    
    past_data <- tayt[event_date<=current_date,]
    
    tayt_ts <- ts(past_data$sold_count, frequency = 7)  
    tayt_decomposed <- decompose(tayt_ts, type="additive")
    model <- arima(tayt_decomposed$random,order = c(4,0,3),xreg = past_data$basket_count_lag)
    
    forecasted=predict(model,n.ahead = 2,newxreg = tayt[event_date==test_dates[i],basket_count_lag])
    tayt[nrow(tayt)-length(test_dates)+i-2, Model_reg := forecasted$pred[2]+tayt_decomposed$seasonal[(nrow(tayt)-length(test_dates)+i-2)%%7+7]+tayt_decomposed$trend[max(which(!is.na(tayt_decomposed$trend)))]]
  }
  m_with_7<-accu(tayt$sold_count[(nrow(tayt)-1-length(test_dates)):(nrow(tayt)-2)],(tayt$Model_reg[(nrow(tayt)-1-length(test_dates)):(nrow(tayt)-2)]))  

For the data with no regressors:

for(i in 1:length(test_dates)){
    
    current_date=test_dates[i]-2
    
    past_data <- tayt[event_date<=current_date,]
    
    tayt_ts <- ts(past_data$sold_count, frequency =7)  
    tayt_decomposed <- decompose(tayt_ts, type="additive")
    model <- arima(tayt_decomposed$random,order = c(4,0,3))
    
    forecasted=predict(model,n.ahead = 2)
    tayt[nrow(tayt)-length(test_dates)+i-2, Model_noreg := forecasted$pred[2]+tayt_decomposed$seasonal[(nrow(tayt)-length(test_dates)+i-2)%%7+7]+tayt_decomposed$trend[max(which(!is.na(tayt_decomposed$trend)))]]
  }
  m_with_no_reg<-accu(tayt$sold_count[(nrow(tayt)-1-length(test_dates)):(nrow(tayt)-2)],(tayt$Model_noreg[(nrow(tayt)-1-length(test_dates)):(nrow(tayt)-2)]))  
rbind(m_with_7,m_with_no_reg)
##   n mean       sd        CV       FBias      MAPE     RMSE      MAD      MADP
## 1 8  257 42.77182 0.1664273 -0.19691640 0.6108043 157.2306 151.4369 0.5892486
## 2 8  257 42.77182 0.1664273 -0.09805438 0.5866824 153.3859 146.6330 0.5705563
##       WMAPE
## 1 0.5892486
## 2 0.5705563

Arima with no external regressors performed better in terms of the performance measures WMAPE, MAPE, MAD is all lower with ARIMA with no regressors.

11 Conclusion

In this homework, first of all we have examined for the possible seasonalities that each prodcut may have. As there are few data points, we mostly examine the daily trends. We have tried to explain the possible reasons for this kind of trends. Such as weekends and people tendecies to shop. After the decomposition of the data set, we create AR, MA and ARIMA models. We added the regressors to these ARIMA Models and finally we have tested for a week period by using two days difference.